GH-48251: [C++][CI] Add CSV fuzzing seed corpus generator#48252
GH-48251: [C++][CI] Add CSV fuzzing seed corpus generator#48252pitrou merged 8 commits intoapache:mainfrom
Conversation
| GeneratorFactory(ValueType min, ValueType max) : min_(min), max_(max) {} | ||
|
|
||
| auto operator()(pcg32_fast* rng) const { | ||
| auto operator()(pcg32* rng) const { |
There was a problem hiding this comment.
It turns out pcg32_fast is not high quality. When used with RandomArrayGenerator::Strings, the first string character would very often be A...
3451a5c to
bcce6c5
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: bcce6c5 Submitted crossbow builds: ursacomputing/crossbow @ actions-e5f01b72d0 |
bcce6c5 to
7d45596
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: 7d45596 Submitted crossbow builds: ursacomputing/crossbow @ actions-09618dfadc |
zanmato1984
left a comment
There was a problem hiding this comment.
Generally lgtm. Some minor questions.
| ARROW_ASSIGN_OR_RAISE(auto buffer, WriteRecordBatch(batch, options)); | ||
|
|
||
| ARROW_ASSIGN_OR_RAISE(auto sample_fn, dir_fn.Join(sample_name())); | ||
| std::cerr << sample_fn.ToString() << std::endl; |
There was a problem hiding this comment.
Why use standard error rater than standard out?
There was a problem hiding this comment.
No precise reason, this is the same thing we're doing in other fuzz corpus generators.
| read_options.block_size = 1000; | ||
| auto parse_options = ParseOptions::Defaults(); | ||
| auto convert_options = ConvertOptions::Defaults(); | ||
| convert_options.auto_dict_encode = true; | ||
| convert_options.auto_dict_max_cardinality = 50; |
There was a problem hiding this comment.
Why do we need these changes?
There was a problem hiding this comment.
The block_size one is to increase the likelihood of chunking and the number of chunks, to exercise chunked reading and parallelization more. The auto_dict_max_cardinality just explicitly sets to the default value, so it's really a no-op but it signals a knob that we might want to turn.
There was a problem hiding this comment.
For the record, most files generated by this PR are 5-10 kB in size.
7d45596 to
2f092c2
Compare
|
@github-actions crossbow submit fuzz |
|
Revision: 2f092c2 Submitted crossbow builds: ursacomputing/crossbow @ actions-805c4b6939
|
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit a32730c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 105 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
The CSV seed corpus for fuzzing currently consists of sample data files from the Pandas project and our own testing repository. This PR adds an executable that generates custom seed files with well-defined characteristics designed to exercise the various data types that the CSV reader is able to infer automatically.
This PR also switches the
RandomArrayGeneratorfacility to the non-"fast" PCG random generators, which give better output especially relative to the seed. This requires some minor changes in the tests to workaround some issues that changing the random generator surfaced.Are these changes tested?
By existing tests.
Are there any user-facing changes?
No.